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Abstract 

Gene Set Enrichment Analysis (GSEA) and its variations aim to discover collections 
of genes that show moderate but coordinated differences in expression. However, such 
techniques may be ineffective if many individual genes in a phenotype-related gene set 
have weak discriminative power. A potential solution is to search for combinations of 
genes that are highly differentiating even when individual genes are not. Although such 
techniques have been developed, these approaches have not been used with GSEA to 
any significant degree because of the large number of potential gene combinations and 
the heterogeneity of measures that assess the differentiation provided by gene groups 
of different sizes. 

To integrate the search for differentiating gene combinations and GSEA, we propose 
a general framework with two key components: (A) a procedure that reduces the 
number of scores to be handled by GSEA to the number of genes by summarizing 
the scores of the gene combinations involving a particular gene in a single score, and 
(B) a procedure to integrate the heterogeneous scores from combinations of different 
sizes and from different gene combination measures by mapping the scores to p- values. 
Experiments on four gene expression data sets demonstrate that the integration of 
GSEA and gene combination search can enhance the power of traditional GSEA by 
discovering gene sets that include genes with weak individual differentiation but strong 
joint discriminative power. Also, gene sets discovered by the integrative framework 
share several common biological processes and improve the consistency of the results 
among three lung cancer data sets. 



"Corresponding author; Supplementary material: http://vk.cs.umn.edu/ICG/ 
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1 Introduction 



Microarray technology is an important tool to monitor gene-expression in bio-medical studies 
[27] . A common experimental design is to compare two sets of samples with different pheno- 
types, e.g. diseased and normal tissue, with the goal of discovering differentially expressed 
genes [TB] . Statistical testing procedures, such as such as the t-test and significance analysis 
of microarrays [31], have been extensively studied and widely used. Subsequently, multiple 
testing corrections are usually applied A comprehensive review of such approaches are 
presented in [25] . 

Differential expression analysis based on univariate statistical tests has several well-known 
limitations. First, due to the low sample size, high dimensionality and the noisy nature of 
microarray data, individual genes may not meet the threshold for statistical significance after 
a correction for multiple hypotheses testing [28] . Second, the lists of differentially expressed 
genes discovered from different studies on the same phenotype have little overlap [28] . 

These limitations motivated the creation of Gene Set Enrichment Analysis GSEA [2^1128] , 
which discovers collections of genes, for example, known biological pathways [28], that show 
moderate but coordinated differentiation. For example, Subramanian and Tamayo et al. [28] 
report that the p53 hypoxial pathway contains many genes that show moderate differentia- 
tion between two lung cancer sample groups with different phenotypes. Although the genes 
in the pathway are not individually significant after multiple hypothesis correction |28j, the 
pathway is. For those familiar with GSEA and its output, Figure [1] shows the GSEA re- 
sults for the p53 hypoxial pathway. GSEA also has the advantages of better interpretability 
and better consistency between the results obtained by different studies on the same phe- 
notype [28]. Ackermann and Strimmer presented a comprehensive review of different GSEA 
variations in pp. 

Unfortunately, GSEA and related techniques may be ineffective if many individual genes 
in a phenotype-related gene set have weak discriminative power. A potential solution to 
this problem is to search for combinations of genes that are highly differentiating even when 
individual genes are not. For this approach, the targets are groups of genes that show 



much stronger discriminative power when combined together [TU]. For example, Figure 2 (a} 



illustrates one type of differentially expressed gene combination discovered in [10]. The two 
genes have weak individual differentiation indicated by the overlapping class symbols on 
both the two axes. In contrast, these two genes are highly discriminative in a joint manner 
indicated by the different correlation structure in the two-dimensional plot, i.e. they are 
correlated along the blue and red dashed line respectively in the triangle and circle class. 
Such a joint differentiation may indicate that the interaction of the two genes is associated 
with the phenotypes even though the two genes, individually, are not. 

Figure 2(b) illustrates another type of pheno type-associated gene combination discovered 
in [10] , usually named differential coexpression [Hi [12] , in which the correlation of the two 
genes are high in one class but much lower in the other class. As discussed in [JJjJ ITo] , 
existing multivariate tests such as Hotelling's T 2 [23j, Dempster's Tl [9 J are not suitable to 
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Figure 1: A gene set (p53 hypoxial pathway) with many moderate but coordinated differ- 
ential expression (towards the right tail of the ranked list). Figure generated with GSEA 
software [28J 

detect such 'complementary' gene combinations because they only screen for differences in 
the multivariate mean vectors, and thus will favor pairs that consist of genes with strong 
marginal effects by themselves but not the genes like the four in Fig. [21 For clarification, 
we use differential gene combination search (denoted as DGCS) to refer to the multivariate 
data analyses that are designed to detect the complementarity of different genes, rather 
than those designed to model the correlation structure of different genes (such as Hotelling's 
T 2 and Dempster's Tl test). A variety of other DGCS measures for complementary gene 
combination search are proposed for gene pairs [20) 19] in addition to the two illustrated 
in figure [2J Several measures are designed for higher-order gene combination beyond pairs 
[T2"| 151] . These approaches can provide biological insights beyond univariate gene analysis 
as shown in pU |T21 [5J] . 

The limitations of GSEA and the capabilities of DGCS motivate a GSEA approach using 
gene combinations in which the score of a gene set is based on both the scores of individual 
genes in the set and the scores from the gene combinations in which these genes participate. 
Unfortunately, gene combination techniques have not been used with the GSEA approach 
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Figure 2: Two highly differential gene-pairs with weak individual discriminative power. Axes 
indicate the expression level of indicated genes. Different color and shape of markers indicates the 
two phenotypes. Figures modified from Dettling et al. [10J. This two types of differential gene 
combinations are measured respectively by two measures M\ and M<i as described in section [2J 



in any significant way because of two key challenges. 

1. Finding a technique to reduce the vast number of gene combinations. There 
are exponentially more gene combinations than individual genes, i.e. in addition to the 
N univariate genes, there are N 2 gene-pairs, iV 3 gene-triplets, etc. Many variations of 
GSEA are based on a ranked list of the N individual genes as illustrated in Figure [U 
Including combinations in the ranked list might work for size-2 combinations [3 [35] , 
but would not be feasible for handling gene combinations of larger sizes. Furthermore, 
this explosion in the number of gene combinations negatively impacts false discovery 
rates. Thus, by adding so many gene combinations, we run the risk that neither groups 
of genes nor individual genes will show statistically significant differentiation. 

2. Combining results from the heterogeneous measures used to score different 
size gene combinations. Furthermore, because a gene can be associated with the 
phenotype either as an univariate variable or together with other genes as a combina- 
tion, the importance of a gene set should be based on both the univariate gene scores 
and the gene combination based scores of its set members. However, different measures 
have a different nature, scale and significance, and thus are not directly comparable (to 
be detailed in section 13^2]) . Indeed, differences exist even between gene combinations of 
the same measure but of different sizes. Therefore, the challenge lies in how to design 
a framework to combine different measures (a univariate measurd3 plus one or more 

^Ackermann and Strimmer pQ suggest that different univariate statistics have similar effect in GSEA and 
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DGCS measures) together within the GSEA framework. 



To the best of our knowledge, no existing work has sufficiently addressed these two chal- 
lenges, although recent work presented in [7J |35] have made initial efforts at adding GSEA 
capabilities to gene combination search. More specifically, two approaches are proposed in 
[7J [35] to help the study of a specific type of size-2 differential combinations as illustrated 



in 



2(b) The experiments in these two studies provide some evidence about the benefits of 



the integration. However, a more general framework is needed that can also handle other 



types of size-2 differential combinations as illustrated in figure 2(a), higher order differen- 
tial combinations (e.g. SDC[12] and the n-statistic [S]), and multiple types of differential 
combinations. 

Contributions: In this paper, we propose a general framework to address the above 
challenges for the effective integration of DGCS and GSEA. Specific contributions are as 
follows: 

1. A gene-combination-to-gene score summarization procedure (procedure A) 
that is designed to handle the exponentially increasing number of gene 
combinations. First, for a given gene combination measure and a certain k, the score 
of a size-k combination is partitioned into k equal parts which are assigned to each of 
the k genes in the combination. Because each gene can participate in up to (f^Zi) size-/c 
combinations, each gene will be assigned with a score from each of these combinations. 
Secondly, an aggregation statistic, e.g., maximum absolute value is used to summarize 
the different scores for a gene. With such a procedure, scores for all the size-/c gene 
combinations are summarized to iV scores for N genes. This procedure can effectively 
retain the O(N) length of the ranked list while handling gene combinations of size-k 
(k > 2). 

2. A score-to-pvalue transformation and summarization procedure (procedure 
B) that is designed to integrate the scores contributed (in procedure A) 
from different gene combination measures and from gene combinations of 
different sizes. The transformation is based on p- values obtained from scores derived 
from phenotype permutations. Such a transformation enables the comparison of scores 
from different measures (either univariate or gene combination measures) and scores 
from the gene combinations of different sizes. Subsequently, among all the p-values of 
a gene, the best is used as an integrated score of statistical significance. 

3. Integration of the above two procedures with GSEA More specifically, after 
procedures A and B, each gene has a single integrated score. Unlike traditional uni- 
variate scores, these iV integrated scores are based on both the univariate statistic and 
the gene combination measures. For the type of GSEA variations that depend on phe- 
notype permutation test, P +1 lists of TV integrative scores are computed, one for the 



thus, we consider only one univariate statistic rather than multiple of them. 
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real class labels and the other for the P permutations. For the type of GSEA variations 
that are based on gene-set permutation test, only the list of integrated scores for the 
real class labels are needed. An independent Matlab implementation of the proposed 
framework is available for download, which allows most existing GSEA frameworks [1] 
to directly utilize the proposed framework to handle gene combinations with almost- 
zero modification. 

4. Experimental results that illustrate the effectiveness of the proposed frame- 
work. We integrated three gene combination measures and the GSEA approach pre- 
sented in [28] and produced experimental results from four gene expression datasets. 
These results demonstrate that the integrative framework can discover gene sets that 
would have been missed without the consideration of gene combinations. This in- 
cludes statistically significant gene sets with moderate differential gene combinations 
whose individual genes have very weak discriminative power. Thus, a gene combina- 
tion assisted GSEA approach can improve traditional GSEA approaches by discovering 
additional disease-associated gene sets. Indeed, the integrative approach also improve 
traditional DGCS since most gene combinations are not statistically significant by 
themselves. Furthermore, we also show that the biologically relevant gene sets dis- 
covered by the integrative framework share several common biological processes and 
improve the consistency of the results among the three lung cancer data sets. 

Overview: The rest of the paper is organized as follow. In section [2j we describe three 
gene combination measures used in the following discussion and experiments. In Section [3j 
we present the technical details of the two procedures of the general integrative framework. 
Experimental design and results are presented in Section HJ followed by conclusions and 
discussions in Section [5j 



2 Differential Gene Combination Measures 

In this section, we describe three DGCS measures for use in the following discussion and 
experiments. Let A = jai, a 2 , . . . , a\A\ \ and B = j&i, & 2 , • • • , be two phenotypic classes 
of samples of size \A\ and \B\ respectively. For each sample in A and B, we have the ex- 
pression value of N genes G = {Gi, G%, . . . , Gjy}. First, we have the following two measures 
(denoted as Mi and M 2 ) defined for a pair of genes as presented in [ID] : 

Mi(Gi, Gj) = corr A (Gi,G 3 ) + corr B {G % , Gj) - corr AuB {G % , Gj) (1) 



M 2 (G i ,G j ) = corr A {Gi,Gj) - corrB{G u Gj) 



(2) 



where Gi and Gj are two genes, and corro(Gi, Gj) represents the correlation of Gi and Gj 
over the samples in set D. As discussed in [ID], Mi and M 2 can detect the joint differential 



expression of two genes as illustrated in Figure 2(a) and Figure 2(b) respectively. Mi and M 2 
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are used as two representative measures for gene pairs. Other options for gene combination 
measures for gene pairs have been investigated in [OS, [20] . 

We use the subspace differential coexpression measure (denoted as M 3 ) proposed in [T2] 
as the representative for measures for size-A: gene combinations, where k can be any integer 
(k>2). 

M 3 (a) = R A {a)-R B (a) (3) 

where a is a set of genes such that a C G and \a\ = k. Ra{o) and Rb(c() respectively 
represent the fraction of samples in A and B over which the genes in a are coexpressed. M3 
is a generalization of M 2 for detecting the differential coexpression of k genes (k > 2), i.e. 
the k genes are highly coexpressed over many samples in one class but over far fewer samples 
in the other. Other options for size-k combinations include the n-statistic [31], SupMaxPair 
[13], etc. 

Signal-to-noise ratio (denoted as M ) is used as the representative of traditional univariate 
statistics as in [281 [21] . 

M (G 1 ) = ^§±-^§1 (4) 

VA(<-*i) + o\B(OiJ 

where Ha(Gi) and /^(Gj) are the mean expression of Gi in class A and B respectively, 
and &A_(Gi) and o^Gj) are the standard deviation of the expression of Gi in class A and B 
respectively. Many other univariate statistics can be found in pp. 

In this paper, these four measures are used as representatives of each category for the 
illustration of the proposed integrative framework. However, the framework is general enough 
to handle other measures from each of these categories. 



3 Methods 

In Section [TJ we motivated the integration of DGCS with GSEA, discussed two challenges 
associated with this integration, and briefly described two main procedures in the proposed 
framework. In this section, we present the technical details of the two procedures and their 
integration with GSEA. 



3.1 Procedure A: combination-to-gene score reduction 

There are two steps in procedure A. In step (1), for each DGCS measure and each size-A; gene 
combination, its score is divided into k equal parts and assigned to each of the k genes in the 
combination. In step (2), the scores assigned to a gene from all the size-k combinations in 
which the gene participates are summarized into a single score by an aggregation functions 
such as max. Note that, for most univariate statistic and DGCS measures which can be 
either positive or negative (e.g. the four measures described in section [2]), the maximum 
is taken over the absolute values of the scores, and the sign of the score with the highest 
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absolute value is recorded for later use. Other simple statistics such as mean or median, or 
sophisticated ones such as weighted summation [33] can also be used. Since the focus of this 
paper is the overall integrative framework, we use max for simplicity. 

We provide a conceptual example of procedure A for a gene G\ with a certain DGCS 
measure M\. This example considers gene combinations up to size 4 for illustration purpose. 
The gene is associated with scores assigned from gene combinations of size 2, 3 and 4 (denoted 
as C 2 , C 3 and C 4 respectively) in which G\ participates. In step (2), the scores from C 2 , C 3 
and C4 are summarized by three maximum values, respectively. Please refer to the Appendix 
section for the illustration of this example. 

Procedure A serves as a general approach to summarize the scores of all the size-& 
combinations into N scores for the N genes. If we want to integrate GSEA with one DGCS 
measure and a specific size-&, procedure A by itself can enable most existing variations of 
GSEA to search, with almost-zero modification, for statistically significant gene sets with 
moderate but coordinated gene combinations of size-&. Such a GSEA approach can col- 
lectively consider the gene combinations affiliated with a gene set, and may provide better 
statistical power and better interpretability for DGCS, as will be shown in the experiments. 

3.2 Procedure B: Score-to-pvalue conversion and summarization 

The hypothesis tested when one DGCS measure, say M x , is integrated with GSEA (by pro- 
cedure A) is that, whether a gene set includes significantly many genes with highly positive 
(or highly negative) combination-based scores measured by M x . An extended hypothesis 
can be whether a gene set includes significantly many genes with highly positive (or highly 
negative) scores, either univariate or combination-based scores measured by different DGCS 
measures. The biological motivation of this extended hypothesis is that, a gene can be asso- 
ciated with the phenotype either as an univariate variable or together with other genes as a 
combination. To test this extended hypothesis, we design a second procedure (B) that can 
integrate the scores of a gene from different measures. 

Before describing the steps in this procedure. We first discuss in detail the challenges of 
integrating heterogeneous scores from different DGCS measures and combinations of different 
sizes. 

1. The different nature of different measures: Different measures are designed to 
capture different aspects of the discriminative power of a gene or a gene combination 
between the two phenotypic classes. Signal-to-noise ratio (Mo), a univariate gene- level 
statistic, measures the difference between the means of the expression of a gene in the 
two classes. In contrast, M 2 , a differential coexpression measure for a pair of genes 
describes the difference of the correlations of a gene-pair in the two classes. Thus, for 
a gene, the score of itself measured by Mq and the score assigned and summarized 
from the f^ 1 ) size-2 gene combinations measured by M 2 are not directly comparable. 
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Similarly, the scores of different DGCS measures can also have a different nature, e.g. 
Mi and M 2 as illustrated in figure [2j 

2. The different scales of different measures: Different measures also have different 
ranges of values. For example, the range of M , Mi, M 2 , and M 3 are [—00, 00], [—3, 3], 
[—2,2] and [— f, 1] respectively. Thus, they are not directly comparable. 

3. Differences in significance between different measures: Even after we normalize 
the scores of different measures to a single range, say [—1, 1], they are still not com- 
parable because the scores of different measures have different statistical significance. 
For example, a normalized Mo score of 0.8 may be less significant than a normalized 
Mi score of 0.5, if there are many genes with normalized Md score greater than 0.8 in 
the permutation test [28], but very few genes with normalized Mi score greater than 
0.5 in the permutation test. Note that, such differences in statistical significance also 
exists between gene combinations of different sizes, even for the same measure. Take 
the subspace differential coexpression measure M 3 as an example. A score of 0.5 for a 
size-2 combination may not be as significant as a score of 0.5 for a size-3 combination 
as discussed in [12j. 

To handle the above heterogeneity, we propose a score-to-pvalue transformation and sum- 
marization procedure that can enable the comparison and integration of the scores of different 
measures and combinations of different sizes. There are three major steps in procedure B. 

3.2.1 Step f: Score-to-pvalue transformation 

Consider a concrete example. For a gene Gi and a measure M 2 , procedure A computes a 
single summarized score. In this step, the original phenotype class labels are permutated say 
1000 times, and for each permutation, the same procedure A is applied, and a corresponding 
score for Gi and M 2 is computed. We denote the score of Gi and M 2 summarized with the 
original label as £^2,0, where i is the gene index, and 2 indicates the measure and means 
it is the score based on the original label. Similarly, we denote the scores computed in each 
of the permutation as Si^j, where 1 < j < 1000. 

These 1001 scores are organized in the table on the left in figure [3j The 1000 scores 
computed in the 1000 permutations can be considered as the null-distribution for gene Gi 
and measure M 2 , and a p- value can be estimated for Sj,2,o • Specifically, if *Si,2,o is positive, the 
p- value is the ratio of the number of scores that are greater or equal to *Si,2,o and the number 
of scores that are positive. Similarly if Si,2,o is negative, the p-value is computed as the ratio 
of the number of scores which are less or equal to £1,2,0 and the number of scores which 
are negative^ Note that, such a score-to-pvalue transformation is done for both Si,2,o and 
each of S i:2 j (1 < j < 1000), if the GSEA approach to be integrated is based on phenotype 

2 Treating positive and negative scores separately follows the practice of GSEA [30] 
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Figure 3: Illustration of step 1 in procedure B (score-to-pvalue transformation) for gene Gi 
and measure Mi- 



permutation test jSU]. Otherwise, only 6^2,0 needs to be transformed to p- value and will be 
used by the GSEA approaches that are based on gene-set permutation |30J. In this paper, 
we illustrate the proposed framework using the GSEA approach presented by Subramanian 
and Tamayo et al. [28] which is based on phenotype permutation test. 

Essentially, step 1 transforms the heterogeneous scores of a gene measured by different 
measures into their corresponding significance values, which are comparable to each other 
although their original values are not. 



3.2.2 Step 2: P-value Summarization 

Suppose that there are Q different measures to be integrated, one of which is a univariate 
statistic, and the others are different DGCS measures for which we consider combinations 
of sizes up to K. After step 1, each gene has a p-value for the univariate measure and up to 
K p- values for each size of gene combination for each measure. In step 2, the bestir p-value 
associated with a gene is selected as the integrated significance. 

Essentially, procedure B integrates the scores of different DGCS measures for a gene 
and the univariate statistic of the gene into a single p-value. Such a statistical significance- 

3 "Best" means it is the lowest raw p-value or the highest — log w transformed p-value 
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based integration of heterogeneous scores enables the comparison and thus the ranking of all 
the iV genes. However, this ranked list does not maintain the original directionality of the 
integrated scores of each gene. In particular, most univariate statistics and DGCS measures 
(e.g. all the four measures described in section [2]) can be either positive or negative. Such 
directionality information is lost in step 1 and 2 because the p-value is non-negative. Next, 
we describe a third step to maintain the directionality in the integration. 

3.2.3 Step 3: Maintaining directionality associated with the integrated p-values 

In the simple case, the measures to be integrated capture the same type of differentiation 
between the two phenotype classes, e.g. M 2 and M 3 . Suppose there are two genes Gi and 
Gj, whose integrated p-values are transformed respectively from two scores measured by M 2 
and M3 in step 2. The signs of these two scores are comparable to each other, because both 
M 2 and M 3 capture the change of coexpression of a combination of genes. Thus, we simply 
use the signs of these two scores as the signs associated with the integrated p-values of Gi 
and Gj. Similarly, we associate a sign to all the iV integrated p-values. And these N p-values 
with associated signs can be used to rank the N genes based on their significance as well as 
their direction of differentiation, i.e. p-values associated with positive signs are ranked with 
descending significance, and afterwards, p-values associated with negative signs are ranked 
with increasing significance. 

In the other case, if the measures to be integrated capture different types of differentiation 
between the two phenotype classes, the directionality can not be fully maintained. For 
example, suppose there are two genes Gi and Gj, whose integrated p-values are transformed 
respectively from two scores measured by M and M 2 in step 2. The signs of these two scores 
are not comparable, because M captures the change of mean expression, and M 2 captures 
the change of coexpression of a combination of genes. Specifically, up-regulation of Gi can 
be associated with either high or low coexpression of another gene-combination in which 
Gj participates. Thus, it is not reasonable to follow the same strategy to associate signs 
to the N integrated p-values. If we know the correspondence of the signs of different genes 
in advance, e.g. the up-regulation of G\ is associated with the low coexpression of G 2 and 
G3, then the signs can be maintained. However, because it is not realistic to assume such 
prior knowledge, we propose the following heuristic approach which has proved a workable 
solution for our initial experiments. Specifically, since the focus of step B is to integrate 
different DGCS measures in addition to the univariate statistic M , we considered M as 
the base measure. For the integrated p-values that are transformed from scores measured 
by Mo in step 2 (say there are w of them), we use the signs of these w Mq scores for the w 
integrated p-values. For the signs of the other N — w genes, we assign positive signs to all 
of them once and negative signs to all of them a second time. Correspondingly, we have two 
ranked lists similar to the simple case described above. 

Note that, if the directionality of differential measures can be preserved, the power of 
this approach will be enhanced. To deal with the situation where signs are not comparable, 
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other approaches will be explored. 



3.3 Integration with GSEA 

From the above description of procedure A and B, we know that, if only one DGCS measure 
is used in the GSEA framework, only procedure A is needed. If one or multiple DGCS 
measures are integrated together with the univariate statistic M in the GSEA framework, 
procedure B is needed in addition. In the first case, the integrative framework outputs 
a ranked list of N scores with associated signs for the original class label, and 1000 lists 
corresponding to the 1000 permutation tests. In the second case, we have two sets of 1001 
lists respectively for the two rounds of maintaining directionality in step 3 in procedure B. 

In either case, the 1001 ranked lists along with the appropriate parameter settings and 
specification of gene sets can be used to run GSEA. The only modification to GSEA is the 
elimination of the initial GSEA step to generate the scores, simulated and actual, that mea- 
sure the level of differentiation between genes across different phenotypes. The proposed inte- 



grative framework is implemented as a Matlab function (available at http: / / vk.cs.umn.edu/ICG/ ), 
independently from the GSEA framework to be integrated in this paper [28]. As summarized 
by Ackermann and Strimmer p], hundreds of variations of GSEA are being used by differ- 
ent research groups. This independently implemented integrative framework can be easily 
applied to other variations of GSEA. 



3.4 A further technical detail 

In our experiments, in order to have a fair comparison, we transform the 1001 ranked lists into 
the exact sample distribution as the original lists corresponding to M ffiGSEA. Specifically, 
we only use the ranking information in the 1001 integrated ranked lists and map to them 
to the values in the original lists based on M ©GSEA. Essentially, the values in the ranked 
list passed to the GSEA framework are exactly the same among M ©GSEA, M i©GSEA, 
M 02 ffiGSEA, M 03 ©GSEA and M i 23 ©GSEA, while the only difference is that the N genes 
have different ranks in the lists. Such a mapping ensures that the additionally discovered gene 
sets are because of the integration of gene-combinations in addition to univariate statistic, 
rather than simply the different value distributions in the 1001 lists. 



4 Results 

In this section, we present the experimental design and results for the evaluation of the 
proposed integrative framework. We first provide a brief description of the data sets and 
parameters used in the experiments. Second, we describe and discuss the comparative ex- 
periments to study whether the integration of DGCS and GSEA (denoted as DGCS©GSEA) 
improves both DGCS and GSEA. The two major evaluation criteria are the statistical power 
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Table 1: Number of gene combinations with FDR less than 0.25 discovered from the four 
data sets by each combination measure 

to discover (additional) significant results, and the consistency of the results across different 
datasets for the study of the same phenotype classes. 

4.1 Data sets 

The four datasets used in the experiments are described as follows: 

1. Three lung cancer datasets respectively denoted as Boston [4], Michigan [3] and Stand- 
ford [14]: all the three data sets consist of gene-expression profiles in tumor samples 
from respectively 62, 86 and 24 patients with lung adenocarcinomas and provide clini- 
cal outcomes (classified as "good" or "poor" outcome). The two phenotypic classes in 
these three datasets are denoted as A and D as in [28] . 

2. A data set from the NCI-60 collection of cancer cell lines for the study of p53 status [25] 
(denoted as P53 data set): the mutational status of the p53 gene has been reported 
for 50 of the NCI-60 cell lines, with 17 being classified as normal and 33 as carrying 
mutations in the gene. The two phenotypic classes in this dataset are denoted as MUT 
and WT as in [28]. 

All four datasets were downloaded from the GSEA website3[28j , and were already pre- 
processed as described in the supplementary file of [28]. For all four data sets, we use the 
gene sets from C 2 in MSigDB 4 as in [28J, as well as the same parameters. 

4.2 Differential gene-combination measures 

We consider one univariate statistic (M ), and three gene-combination measures (M 1; M 2 
and M 3 ) in our experiments. These four measures are described in section [2J M\ and M 2 
are defined only for size-2 combinations. For M3, we considered gene-combinations of size-2 
and size-3 for the illustration of concept. 
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Table 2: Number of gene sets with FDR less than 0.25 discovered from the four datasets by 
integrating GSEA with each of the three combination measures. One or multiple biological 
process(es) are indicated as superscript, from which we can observe the consistency across 
three lung cancer data sets. A: apoptosis related pathways; H: responses to hypoxia; S: 
sppaPathway; I: insulin-signaling sets; X: oxidative-phosphorylation related sets; P: p53- 
related sets. The names of the discovered gene sets and their FDRs are available in the 
Appendix section. 

4.3 Ql: Does GSEA-assisted DGCS improve traditional DGCS? 

In this section, we study whether the question (Ql) of whether integration of DGCS and 
GSEA can improve traditional DGCS. For this comparison, we consider the integration 
of DGCS and GSEA as a GSEA-assisted DGCS approach. We first apply the traditional 
DGCS approaches on the four datasets to find statistically significant gene-combinations. We 
denote the three DGCS approaches respectively with the names of the three measures, i.e. 
Mi, M 2 and M 3 . Second, we apply the integrative framework, in which GSEA is integrated 
respectively with the three DGCS measures, to find statistically significant gene sets with 
moderate but coordinated differential gene-combinations. We denote the three instances 
of the integrative approach respectively as Mi©GSEA, M 2 ®GSEA and M 3 ©GSEA. Then, 
we compare the results of M 1 , M 2 and M 3 , respectively with the results of Mi ©GSEA, 
M 2 ffiGSEA and M 3 ©GSEA. 

In this section, we study whether the question (Ql) of whether integration of DGCS 
and GSEA can improve traditional DGCS. For this comparison, we consider the integration 
of DGCS and GSEA as a GSEA-assisted DGCS approach. We first apply the traditional 
DGCS approaches on the four datasets to find statistically significant gene-combinations. We 
denote the three DGCS approaches respectively with the names of the three measures, i.e. 
Mi, M 2 and M 3 . Second, we apply the integrative framework, in which GSEA is integrated 
respectively with the three DGCS measures, to find statistically significant gene sets with 
moderate but coordinated differential gene-combinations. We denote the three instances 
of the integrative approach respectively as Mi ©GSEA, M 2 ffiGSEA and M 3 ©GSEA. Then, 
we compare the results of Mi, M 2 and M 3 , respectively with the results of Mi ©GSEA, 
M 2 ©GSEA and M 3 ffiGSEA. 

Table 1 lists the number of statistically significant gene combinations discovered respec- 
tively by the three measures on each of the four datasets, with an FDR threshold of 0.25. 
Table 1 lists the number of statistically significant gene sets discovered by integrating GSEA 

4 http://www. broadinstitute.org/gsea/ 
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respectively with the three DGCS measures on each of the four datasets, also with the same 
FDR threshold of 0.25. Three major observations can be made by comparing the two tables: 

4.3.1 GSEA-assisted DGCS has better statistical power than traditional DGCS 

Table 1 shows that, in most cases, traditional DGCS discovers very few (less than 3) statisti- 
cally significant gene combinations (although M2 and M3 have 645 and 10 gene-combinations 
on the Boston data set, none of them have FDR lower than 0.10). In contrast, table 2 shows 
that the integration of GSEA with the three combination measures discover multiple signif- 
icant gene sets in most of the cases. This difference implies that the discovered statistically 
significant gene sets include many moderate but coordinated differential gene combinations, 
even though the combinations are not significant by themselves as shown in table 1. This 
comparison demonstrates that traditional DGCS, similar to univariate gene analysis, has 
limited statistical power, and DGCS©GSEA can increase that power. 

4.3.2 GSEA-assisted DGCS has better result consistency than traditional DGCS 

We further compare DGCS and DGCS©GSEA by studying the consistency of their results 
on the first three data sets that are all from lung cancer studies, as done in [28]. For DGCS, 
Mi discovered 2 genes on Michigan but nothing from Boston and Stanford; M2 discovered 
645 combinations on Boston but only 1 and 2 from Michigan and Stanford, respectively, 
and there are no common ones between the 645, 1, and 2 gene combinations; M 3 discovered 
10 genes on Boston but only 1 gene on Michigan and nothing from Stanford, and the 10 
and 1 combinations do not overlap. The inconsistent results make the follow-up biological 
interpretation very difficult. 

In contrast, when the three DGCS measures are integrated with GSEA, several consis- 
tent themes can be observed: (i) Apoptosis related pathways (marked by A in table 2): 
Mi ©GSEA discovered four gene sets on Boston, three of which are known to be closely 
related to cancer and specifically to apoptosis, i.e. nfkbpathway, ST- Gaq- Pathway and TNF- 
Pathway. This apoptosis theme is shared by the gene sets discovered by Mi-GSEA from 
Michigan and Stanford, i.e. Monocyte- AD -Pathway, hivnej Pathway, deathPathway and cas- 
pasePathway. These apoptosis related pathways are enriched with the lung cancer samples 
with good outcome, which makes sense biologically and also corresponds to the proliferation 
theme supported by the gene sets enriched with the samples with poor outcome as reported 
in [2S] ■ Several other examples of the result consistency, as indicated by other superscripts in 
Table 2, are in the technical report. This comparison demonstrates that traditional DGCS, 
like univariate gene analysis, has poor result consistency across the three lung cancer data 
sets, and DGCS©GSEA can improve its consistency by integrating DGCS measures with 
GSEA. 
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4.3.3 GSEA-assisted DGCS with different DGCS measures complement each 
other 

The number of significant gene sets discovered by the three versions of GSEA varies, i.e. 
Mi©GSEA and M 2 ©GSEA discovered a bit larger number of significant gene sets than 
M 3 ©GSEA. However, M 3 ©GSEA still discovered several gene sets that are not discovered 
by Mi©GSEA or M 2 ©GSEA, e.g. one gene set from the Michigan data set and three from 
the Stanford data set. This indicates that Mi ©GSEA, M 2 ©GSEA and M 3 ©GSEA have 
complementary perspectives, i.e. different combination measures capture different aspects of 
the difference between the phenotype classes (recall the two types of combinations in Figure 
[2]). This also demonstrates the proposed framework is general enough to integrate any type 
of DGCS with GSEA. 

4.4 Q2: Does DGCS-assisted GSEA improve GSEA? 

In this Section, we want to answer the question (Q2) of whether the integration of DGCS and 
GSEA can improve traditional GSEA. For this comparison, we consider the integration of 
DGCS and GSEA as a DGCS-assisted GSEA approach. We design three sets of comparisons. 
Firstly, we compare the traditional univariate-statistic based GSEA (denoted as M ffiGSEA) 
with the integrative framework where one gene-combinations measure is used instead of 
M . Specifically, we compare the gene sets discovered by M ©GSEA with the gene sets 
discovered by Mi ©GSEA, M 2 ©GSEA and M 3 ©GSEA. Then, we compare M ©GSEA with 
the integrative framework where one gene-combinations measure is used in addition to Mo, 
i.e. M i©GSEA, M 02 ffiGSEA and M 03 ffiGSEA. Furthermore, we also study the integration 
of multiple gene-combinations measure in addition to M , e.g. M i 23 ©GSEA. 

Figure H] displays the statistically significant gene sets discovered with different (combi- 
nations of) measures respectively from the four datasets. An FDR threshold of 0.25 is used 
as in [2S] for comparison purpose. The results presented in [2H] are exactly reproduced, 
i.e. the gene sets listed in the rows corresponding to M ffiGSEA. In each of these four fig- 
ures, we consider the traditional univariate-statistic based GSEA (Mo©GSEA) as the base- 
line, and compare it with the rows corresponding to Mi©GSEA, M 2 ©GSEA, M 3 ©GSEA, 
M i©GSEA, M 02 ©GSEA, M 03 ©GSEA and M 0i i i2j3 ©GSEA. From these comparisons, the 
following observations can be made. 

4.4.1 DGCS-assisted GSEA discovers additional significant gene sets 

First, we compare the rows corresponding to Mi©GSEA, M 2 ©GSEA, M 3 ©GSEA with the 
rows corresponding to M ©GSEA. We bolded the additional gene sets that are only discov- 
ered by MiffiGSEA, M 2 ©GSEA, M 3 ©GSEA. For example, with M ©GSEA, no statistically 
significant gene sets have been enriched with class A in the Boston data set. In contrast, 
Mi©GSEA discovered 4 gene sets, three out of which (discussed in Ql) are related to apop- 
tosis which is consistent with the results on Michigan and Stanford. On the Michigan 
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dataset, M 2 ©GSEA discovered a gene set beta-Alanine-metabolism that is not discovered by 
M ffiGSEA. This gene set is related to the responses of hypoxia, which is consistent with 
the results on Boston and Stanford. It is worth noting that, although most studies did not 
report statistically significant gene sets on the Stanford dataset due to the very small sample 
size, Mi©GSEA, M 2 ©GSEA, M 3 ffiGSEA respectively discovered 4, 4 and 3 significant gene 
sets. These additional gene sets were discovered because the three DGCS measures capture 
different types of the differentiation between the two phenotype classes, compared to the 
traditional univariate differential expression-based GSEA. 

Second, we compare the rows corresponding to M i©GSEA, M 02 ©GSEA, M 3©GSEA 
with the rows corresponding to M ffiGSEA. We bolded the additional gene sets that are 
only discovered by the integrative approach. For example, on the Boston data set, M based 
GSEA discovered 8 gene sets. In addition, M 01 ©GSEA discovered the proteasomePathway 
gene set, and M 02 ©GSEA discovered the p53- signaling gene set. Both ubiquitin-proteasome 
pathway and p53-signaling pathway are well-known cancer-related pathways that are also 
specifically related to lung cancer [2TJ E|. (Additional examples are in the technical report.) 
The gene sets that are discovered by DGCS-assisted GSEA but not by M -GSEA illustrate 
the benefits of using DGCS to assist GSEA. 

Next, we also observed that integrating multiple DGCS measures can further discover 
statistically significant gene sets. For illustration purpose, we compare the rows correspond- 
ing to M i©GSEA, M 02 ©GSEA, M 03 ffiGSEA with the rows corresponding to M i 23 ©GSEA. 
M)i23©GSEA discovers the gSPathway gene set and the gskSPathway gene set, respectively 
from the Boston and the Michigan dataset. Neither of these two pathways are discovered 
by M ©GSEA, M i©GSEA, M 02 ©GSEA and M 03 ©GSEA. The curated gene set g2Pathway 
contains the genes related to the G2/M transition, which is shown to be regulated by p53 
|29j . a well-known cancer- related gene. The curated gene set gskSPathway is the signaling 
pathway of GSK-3-/3, which has been shown to be related to different types of cancer [5| 122]. 
These two cancer-related pathways are discovered by M i 2 3©GSEA but not by M ffiGSEA, 
M i©GSEA, M 02 ©GSEA and M 03 ffiGSEA. This indicates that different members of these 
two pathways are differential between the two phenotype groups in different manners, i.e. 
the differentiation of some genes is captured by M , some by Mi, some by M 2 and some by 
M 3 . These two pathways can be discovered to be statistically significant only when these 
measures are used together in the integrative framework. This demonstrates the benefits of 
the proposed framework for integrating multiple DGCS measures with a univariate measure. 

It is worth noting that, the gene sets discovered by the integrative framework with multi- 
ple measures are not necessarily a superset of those discovered by integrating each individual 
measure with GSEA since, when different DGCS measures are integrated with GSEA, the 
null- hypotheses tested in the GSEA framework are correspondingly different. The highlight 
of the integrative framework is that, additional gene sets can be discovered when differ- 
ent DGCS measures are used to assist the traditional univariate statistic-based GSEA. In 
practice, these different versions of GSEA should be used collectively. 
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4.4.2 DGCS-assisted GSEA discovers gene sets with lower FDRs 



Even when a gene set is discovered both before and after a DGCS measure is integrated 
into the framework, we can observe several interesting cases where the FDR of a gene set 
becomes much lower after the integration. We bolded the FDRs that significantly decreased 
when they are discovered by the integrative approach. For example, M i23©-GSEA, in 
which Mo, M\, M2 and M3 are integrated together, discovers p53hypoxialPathway with an 
much lower FDR of 0.00095, two-order lower than Mo-GSEA. This example indicates that 
several members of p53hypoxialP athway have weak individual differentiation measured by 
Mq, but have more significant differentiation when they are measured by M 3 . This and 
other similar examples demonstrates the benefits of the proposed framework for integrating 
multiple DGCS measures. 

4.4.3 DGCS-assisted GSEA further improve the consistency across the three 
lung cancer data sets 

As presented in [28J, MoffiGSEA discovered 8 and 11 gene sets respectively from the Boston 
and Michigan data sets, and 5 of the 8 in Boston and 6 of the 11 in Michigan are common. 
The three unmatched gene sets that are discovered in Boston but not in Michigan are GLUT- 
DOWN, LEU-DOWN and CellCycleCheckpoint. Interestingly, the latter two are discovered 
from both the Boston and the Michigan data sets by M m ©gseaS Such observations suggest 
that DGCS-assisted GSEA also provides new insights to the consistency between different 
data sets. 

4.4.4 Additional issues of multiple hypothesis testing 

Because different combinations of measures are used in the integrative framework, addi- 
tional issues of multiple hypothesis testing arise, even though multiple hypothesis test- 
ing has been addressed for each measure via the phenotype permutation test procedure 
in the GSEA framework proposed in [28]. To investigate this, we designed experiments 
with 4 of the 15(= 2 4 - 1) possibilities of integrations, i.e. M i©GSEA, M 02 ©GSEA, 
M 03 ©GSEA and M 012 3©GSEA. Even using a collective (meta-level) multiple hypothesis 
correction, many discovered gene sets would still be significant. For examples, M i23©GSEA 
discovers p53hypoxialP athway from the Boston data set with a low FDR of 0.00095, and 
Moi23©GSEA discovers deathP athway from the Michigan data set with a lower FDR of 
0.00197. We also did additional permutation tests, in which we generate random gene sets 
with the same sizes as the sets in MSigDB C2, and do the same set of experiments as shown 
in Figure HI The FDR values of the random gene sets computed in the integrative framework 
are mostly insignificant (higher than 0.25). 

5 The CellCyclePathway discovered on Michigan and the cell-cycle-checkpoint discovered on Boston are 
both cell-cycle related gene sets 
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5 Discussion 



In this paper we motivated the integration of differential gene-combination search and gene 
set enrichment analysis for bi-directional benefits on both them. We proposed a general 
integrative framework that can handle gene-combinations of different sizes (k > 2) and 
different gene-combination measures in addition to an univariate statistic used in traditional 
GSEA. The experimental results demonstrated that, on one hand, GSEA-assisted DGCS 
has better statistical power and result consistency than traditional DGCS. On the other 
hand, DGCS-assisted GSEA can discover additional statistically significant gene sets that 
are ignored by traditional GSEA and further improve the result consistency of the traditional 
GSEA. 

The proposed framework can be extended in several ways. Different variations of GSEA 
will be considered. Along these lines, we note that the proposed integrative framework is 
general enough to integrate most existing variations of GSEA approaches summarized in pQ 
with minimal amount of modification. Also, it should be possible to integrate DGCS and 
gene-subnetwork discovery. Both GSEA and gene-subnetwork discovery [T7J [8] can discover 
collections of genes, either known gene sets [28] or subnetworks in a molecular network (e.g. 
protein interaction network), that show moderate but coordinated differentiation. In this 
paper, we integrate DGCS and GSEA as an illustration of the general framework for inte- 
grating scores from different gene-combination measures and gene-combinations of different 
sizes, in addition to the traditional univariate statistic, but the same framework also ap- 
plies to the integration of DGCS and gene subnetwork discovery. Another direction is the 
use of this framework for the analysis of (GWAS) SNP data, by following the methodology 
proposed in recently work on pathway/network based analysis of GWAS datasets [321 [2]. Fi- 
nally, it may be possible to use constraints on gene-combinations to improve our framework. 
In procedure A, for each gene-combination measure and an integer k, the score of a gene is 
assigned from all the f^Zi) possible gene-combinations. A further extension of procedure A 
is to only consider the gene combinations, in which the k genes appear in a common gene 
set, e.g. a pathway. Such gene-set-based constraints may better control false positive gene 
combinations and improve the statistical power of the whole integrative framework. 

6 Appendix 

6.1 Illustration of procedure A 

The illustration of procedure A is in figure [5j 

6.2 Complete Gene-set Table 

Due to the space limit, Table 2 and Figure H] are both summarized from the four complete 
tables that are available at http://vk.cs.umn.edu/ICG/. Specifically, Table 2 is a high-level 
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summary of the number of gene sets discovered and the biological processes associated with 
each of the gene sets. In figure HJ we listed the complete results for M ©GSEA (the baseline), 
while for the other rows, we only list a gene set if it is only discovered by the integrative 
approach (with bolded name), or it has a non-trivially decreased FDR when it is discovered 
by the integrative approach (with bolded FDR). 
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Figure 4: Common captions for the four tables: Statistically significant gene sets 
discovered by different (combinations of) measures from each of the four data sets. The 
first row of each table shows the name of the data set, and the second row indicates the 
two phenotype classes in the data set that a gene set can be enriched with. The first 
column indicates the measures used in the integrative framework. For each data set and 
each (combination of) measure(s), we list the names of the statistically significant gene sets 
and the corresponding FDRs for both the classes. The traditional univariate-statistic based 
GSEA (M0©GSEA) is considered as the baseline. For the other rows, we only list a gene 
set if it is only discovered by the integrative approach (with bolded name), or it has a 
non-trivially decreased FDR when it is discovered by the integrative approach (with bolded 
FDR). Please refer to the Appendix section for the complete tables. 
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Figure 5: Illustration of procedure A (combination-to-gene score assignment) for G\. The 
three ellipses represent the three gene combinations that G± participates to with respect to 
measure M\. 
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